{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WfGpInivS0fG"
   },
   "source": [
    "<h2 align=\"center\">点击下列图标在线运行HanLP</h2>\n",
    "<div align=\"center\">\n",
    "\t<a href=\"https://colab.research.google.com/github/hankcs/HanLP/blob/doc-zh/plugins/hanlp_demo/hanlp_demo/zh/tok_stl.ipynb\" target=\"_blank\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\t<a href=\"https://mybinder.org/v2/gh/hankcs/HanLP/doc-zh?filepath=plugins%2Fhanlp_demo%2Fhanlp_demo%2Fzh%2Ftok_stl.ipynb\" target=\"_blank\"><img src=\"https://mybinder.org/badge_logo.svg\" alt=\"Open In Binder\"/></a>\n",
    "</div>\n",
    "\n",
    "## 安装"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IYwV-UkNNzFp"
   },
   "source": [
    "无论是Windows、Linux还是macOS，HanLP的安装只需一句话搞定："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1Uf_u7ddMhUt"
   },
   "outputs": [],
   "source": [
    "!pip install hanlp -U"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pp-1KqEOOJ4t"
   },
   "source": [
    "## 加载模型\n",
    "HanLP的工作流程是先加载模型，模型的标示符存储在`hanlp.pretrained`这个包中，按照NLP任务归类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "4M7ka0K5OMWU",
    "outputId": "f931579a-f5a8-487a-a89e-33d5477584c3"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'SIGHAN2005_PKU_CONVSEG': 'https://file.hankcs.com/hanlp/tok/sighan2005-pku-convseg_20200110_153722.zip',\n",
       " 'SIGHAN2005_MSR_CONVSEG': 'https://file.hankcs.com/hanlp/tok/convseg-msr-nocrf-noembed_20200110_153524.zip',\n",
       " 'CTB6_CONVSEG': 'https://file.hankcs.com/hanlp/tok/ctb6_convseg_nowe_nocrf_20200110_004046.zip',\n",
       " 'PKU_NAME_MERGED_SIX_MONTHS_CONVSEG': 'https://file.hankcs.com/hanlp/tok/pku98_6m_conv_ngram_20200110_134736.zip',\n",
       " 'LARGE_ALBERT_BASE': 'https://file.hankcs.com/hanlp/tok/large_corpus_cws_albert_base_20211228_160926.zip',\n",
       " 'SIGHAN2005_PKU_BERT_BASE_ZH': 'https://file.hankcs.com/hanlp/tok/sighan2005_pku_bert_base_zh_20201231_141130.zip',\n",
       " 'COARSE_ELECTRA_SMALL_ZH': 'https://file.hankcs.com/hanlp/tok/coarse_electra_small_20220616_012050.zip',\n",
       " 'FINE_ELECTRA_SMALL_ZH': 'https://file.hankcs.com/hanlp/tok/fine_electra_small_20220615_231803.zip',\n",
       " 'CTB9_TOK_ELECTRA_SMALL': 'https://file.hankcs.com/hanlp/tok/ctb9_electra_small_20220215_205427.zip',\n",
       " 'CTB9_TOK_ELECTRA_BASE': 'http://download.hanlp.com/tok/extra/ctb9_tok_electra_base_20220426_111949.zip',\n",
       " 'CTB9_TOK_ELECTRA_BASE_CRF': 'http://download.hanlp.com/tok/extra/ctb9_tok_electra_base_crf_20220426_161255.zip',\n",
       " 'MSR_TOK_ELECTRA_BASE_CRF': 'http://download.hanlp.com/tok/extra/msra_crf_electra_base_20220507_113936.zip',\n",
       " 'UD_TOK_MMINILMV2L6': 'https://file.hankcs.com/hanlp/tok/ud_tok_mMiniLMv2L6_no_space_mul_20220619_091824.zip',\n",
       " 'UD_TOK_MMINILMV2L12': 'https://file.hankcs.com/hanlp/tok/ud_tok_mMiniLMv2L12_no_space_mul_20220619_091159.zip'}"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import hanlp\n",
    "hanlp.pretrained.tok.ALL # 语种见名称最后一个字段或相应语料库"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BMW528wGNulM"
   },
   "source": [
    "调用`hanlp.load`进行加载，模型会自动下载到本地缓存。自然语言处理分为许多任务，分词只是最初级的一个。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "0tmKBu7sNAXX",
    "outputId": "8977891f-9e64-4e39-8ce6-264a791541a3"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<hanlp.components.tokenizers.transformer.TransformerTaggingTokenizer at 0x10420e5b0>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok = hanlp.load(hanlp.pretrained.tok.COARSE_ELECTRA_SMALL_ZH)\n",
    "tok"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 进阶知识\n",
    "你可以通过加载不同的模型实现各种颗粒度、各种分词标准、各种领域的中文分词。其中，coarse和fine模型训练自`9970`万字的大型综合语料库，覆盖新闻、社交媒体、金融、法律等多个领域，是已知范围内**全世界最大**的中文分词语料库。语料库规模决定实际效果，面向生产环境的语料库应当在千万字量级。欢迎用户在自己的语料上[训练或微调模型](https://github.com/hankcs/HanLP/tree/master/plugins/hanlp_demo/hanlp_demo/zh/train)以适应新领域。语料库标注标准决定最终的分词标准，模型的准确率决定多大程度上再现该分词标准。更多背景知识请参考[《自然语言处理入门》](http://nlp.hankcs.com/book.php)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KYH1oEKkctuy"
   },
   "source": [
    "## 执行分词"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "uzex--zFcqKB",
    "outputId": "a4db6808-1039-4803-84af-2687cce0fa7b"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['商品', '和', '服务', '。'], ['晓美焰', '来到', '北京立方庭', '参观', '自然语义科技公司']]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok(['商品和服务。', '晓美焰来到北京立方庭参观自然语义科技公司'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 细分标准"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "你可以通过加载`FINE_ELECTRA_SMALL_ZH`模型实现细粒度中文分词："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "tok_fine = hanlp.load(hanlp.pretrained.tok.FINE_ELECTRA_SMALL_ZH)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "无论哪个模型，分词器的接口是完全一致的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok_fine('晓美焰来到北京立方庭参观自然语义科技公司')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 无限长度\n",
    "众所周知，Transformer的输入有长度限制（通常是512）。幸运地是，HanLP的滑动窗口技巧完美地突破了该限制。只要你的内存（显存）足够，HanLP就可以处理无限长的句子。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 并行分词\n",
    "无论是CPU还是GPU，同时传入多个句子都将并行分词。也就是说，仅花费1个句子的时间可以处理多个句子。然而工作研究中的文本通常是一篇文档，而不是许多句子。此时可以利用HanLP提供的分句功能和流水线模式优雅应对，既能处理长文本又能并行化。只需创建一个流水线`pipeline`，第一级管道分句，第二级管道分词："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['量体裁衣', '，', 'HanLP', '提供', 'RESTful', '和', 'native', '两种', 'API', '。'],\n",
       " ['两者', '在', '语义', '上', '保持', '一致', '，', '在', '代码', '上', '坚持', '开源', '。']]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "HanLP = hanlp.pipeline() \\\n",
    "    .append(hanlp.utils.rules.split_sentence) \\\n",
    "    .append(tok)\n",
    "HanLP('量体裁衣，HanLP提供RESTful和native两种API。两者在语义上保持一致，在代码上坚持开源。')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "返回结果是每个句子的分词`list`，如果要将它们合并到一个`list`里该怎么办呢？聪明的用户可能已经想到了，再加一级`lambda`管道："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['量体裁衣', '，', 'HanLP', '提供', 'RESTful', '和', 'native', '两种', 'API', '。', '两者', '在', '语义', '上', '保持', '一致', '，', '在', '代码', '上', '坚持', '开源', '。']\n"
     ]
    }
   ],
   "source": [
    "HanLP.append(lambda sents: sum(sents, []))\n",
    "print(HanLP('量体裁衣，HanLP提供RESTful和native两种API。两者在语义上保持一致，在代码上坚持开源。'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "suUL042zPpLj"
   },
   "source": [
    "## 自定义词典"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1q4MUpgVQNlu"
   },
   "source": [
    "智者千虑，必有一失。模型偶尔也会犯错误，比如某个旧版本模型在不挂词典时会犯以下错误（最新版已经修复）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "2zZkH9tRQOoi",
    "outputId": "a74db6c6-0a71-411c-de78-60621a43eded",
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['首相', '和', '川', '普通', '电话']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok = hanlp.load('https://file.hankcs.com/hanlp/tok/coarse_electra_small_20220220_013548.zip')\n",
    "tok.dict_force = tok.dict_combine = None\n",
    "tok(\"首相和川普通电话\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面分词任务两个成员变量`dict_force`和`dict_combine`为自定义词典："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "id": "AzYShIssP6kq",
    "outputId": "ce3bb1aa-5042-47d7-8ac9-7ed0fd478c77"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(None, None)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok.dict_combine, tok.dict_force"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "HanLP支持合并和强制两种优先级的自定义词典，以满足不同场景的需求。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "F-9gAeIVQUFG"
   },
   "source": [
    "### 强制模式\n",
    "强制模式`dict_force`优先输出正向最长匹配到的自定义词条，在这个案例中，用户的第一反应也许是将`川普`加入到`dict_force`中，强制分词器输出`川普`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "F8M8cyBrQduw",
    "outputId": "c156513c-d13c-47f1-bc3a-c73a8649ddb1"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['首相', '和', '川普', '通', '电话'],\n",
       " ['银', '川普', '通人', '与', '川普', '通', '电话', '讲', '四', '川普', '通话']]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok.dict_force = {'川普'}\n",
    "tok([\"首相和川普通电话\", \"银川普通人与川普通电话讲四川普通话\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DDqQxqQaTayv"
   },
   "source": [
    "然而与大众的朴素认知不同，词典优先级最高未必是好事。极有可能匹配到不该分出来的自定义词语，导致歧义。即便是将`普通人`或`普通话`加入到词典中也无济于事，因为在正向最长匹配第二个句子的过程中，会匹配到`川普`而不会匹配后两者。这也解释了为什么自定义词典中存在的词可能分不出来：当歧义发生时，两个词语发生交叉冲突，自然有所取舍，无法同时输出两者。那种同时输出句子或长单词中所有可能的单词，并且允许单词交叉的算法，并非分词，而是多模式字符串匹配。你需要基本的算法知识才能理解这一点，总之一般情况下应当慎用强制模式，详见[《自然语言处理入门》](http://nlp.hankcs.com/book.php)第二章。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "自定义词语越长，越不容易发生歧义。这启发我们将强制模式拓展为强制校正功能。强制校正原理相似，但会将匹配到的自定义词条替换为相应的分词结果:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "bjnEqDaATdVr",
    "outputId": "2e694aed-a71f-4a28-d981-0767d9e263e9"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['首相', '和', '川普', '通', '电话'],\n",
       " ['银川', '普通人', '与', '川普', '通', '电话', '讲', '四川', '普通话']]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok.dict_force = {'川普通电话': ['川普', '通', '电话']}\n",
    "tok([\"首相和川普通电话\", \"银川普通人与川普通电话讲四川普通话\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "强制校正是一种短平快的规则补丁，需要针对每种可能产生歧义的语境，截取一个片段执行校正。当你积累了很多歧义片段与相应的校正补丁后，其实就应该考虑微调模型。微调可以让模型增量式学习这些歧义语境，摆脱对补丁规则的依赖，同时举一反三应对新的语境。从错误中积累经验，用经验预测未来，这就是机器学习与人工智能的魅力。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "事实上，“川普通电话”这种例子不需要词典即可分对。只需提供给神经网络足够的上下文线索（这也是真实文本所具备的），告诉神经网络“川普是美国总统”："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['首相', '和', '川普', '通', '电话', '，', '川普', '是', '美国', '总统', '。'], ['银川', '普通人', '与', '川普', '通', '电话', '讲', '四川', '普通话', '，', '川普', '是', '美国', '总统', '。']]\n"
     ]
    }
   ],
   "source": [
    "tok.dict_force = tok.dict_combine = None\n",
    "print(tok([\"首相和川普通电话，川普是美国总统。\", \"银川普通人与川普通电话讲四川普通话，川普是美国总统。\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9aRzEeRvTlRr"
   },
   "source": [
    "在上面的例子中，虽然词典对“川普”没有施加任何影响，但是更丰富的上下文促进了神经网络对语境的理解，使其得出了正确的结果。深度学习中的神经网络似乎展示了些许智能，感兴趣的初学者可参考[《自然语言处理入门》](http://nlp.hankcs.com/book.php)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ldKAnVoSTgxb"
   },
   "source": [
    "### 合并模式\n",
    "合并模式的优先级低于统计模型，即`dict_combine`会在统计模型的分词结果上执行最长匹配并合并匹配到的词条。一般情况下，推荐使用该模式。比如，将“美国总统”加入`dict_combine`后会合并`['美国', '总统']`，而不会合并`['美国', '总', '统筹部']`为`['美国总统', '筹部']`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "bwIu0f6wTgbF",
    "outputId": "22807b6a-3472-431b-d1e3-95f6b761c84c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['首相', '和', '川普', '通', '电话', '，', '川普', '是', '美国总统', '。'], ['银川', '普通人', '与', '川普', '通', '电话', '讲', '四川', '普通话', '，', '川普', '是', '美国总统', '。'], ['美国', '总统筹部', '部长', '是', '谁', '？']]\n"
     ]
    }
   ],
   "source": [
    "tok.dict_force = None\n",
    "tok.dict_combine = {'美国总统'}\n",
    "print(tok([\"首相和川普通电话，川普是美国总统。\", \"银川普通人与川普通电话讲四川普通话，川普是美国总统。\", \"美国总统筹部部长是谁？\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 空格单词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "含有空格、制表符等（Transformer tokenizer去掉的字符）的词语需要用`tuple`的形式提供："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['如何', '评价', 'iPad Pro', '？', 'iPad  Pro', '有', '2个空格']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok.dict_combine = {('iPad', 'Pro'), '2个空格'}\n",
    "tok(\"如何评价iPad Pro ？iPad  Pro有2个空格\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "聪明的用户请继续阅读，`tuple`词典中的字符串其实等价于该字符串的所有可能的切分方式："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys([('iPad', 'Pro'), ('2个空格',), ('2', '个', '空格'), ('2', '个', '空', '格'), ('2', '个空格'), ('2', '个空', '格'), ('2个', '空', '格'), ('2个', '空格'), ('2个空', '格')])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dict(tok.dict_combine.config[\"dictionary\"]).keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 单词位置"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "HanLP支持输出每个单词在文本中的原始位置，以便用于搜索引擎等场景。在词法分析中，非语素字符（空格、换行、制表符等）会被剔除，此时需要额外的位置信息才能定位每个单词："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['2021', 0, 4], ['年', 5, 6], ['HanLPv2.1', 7, 16], ['为', 17, 18], ['生产', 18, 20], ['环境', 20, 22], ['带来', 22, 24], ['次', 24, 25], ['世代', 25, 27], ['最', 27, 28], ['先进', 28, 30], ['的', 30, 31], ['多', 31, 32], ['语种', 32, 34], ['NLP', 34, 37], ['技术', 37, 39], ['。', 39, 40]]\n"
     ]
    }
   ],
   "source": [
    "tok.config.output_spans = True\n",
    "sent = '2021 年\\nHanLPv2.1 为生产环境带来次世代最先进的多语种NLP技术。'\n",
    "word_offsets = tok(sent)\n",
    "print(word_offsets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "返回格式为三元组（单词，单词的起始下标，单词的终止下标），下标以字符级别计量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "for word, begin, end in word_offsets:\n",
    "    assert word == sent[begin:end]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 多语种支持"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "得益于语言无关的设计，以及大规模多语种语料库，最近HanLP发布了支持[130种语言](https://hanlp.hankcs.com/docs/api/hanlp/pretrained/tok.html#hanlp.pretrained.tok.UD_TOK_MMINILMV2L12)的单任务分词器。用法与中文分词器相同："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "mul = hanlp.load(hanlp.pretrained.tok.UD_TOK_MMINILMV2L6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['In', '2021', ',', 'HanLPv2.1', 'delivers', 'state-of-the-art', 'multilingual', 'NLP', 'techniques', 'to', 'production', 'environments', '.'], ['2021年', '、', 'HanLPv2.1', 'は', '次世代', 'の', '最', '先端', '多', '言語', 'NLP', '技術', 'を', '本番', '環境', 'に', '導入', 'し', 'ます', '。'], ['2021年', 'HanLPv2.1', '为', '生产', '环境', '带来', '次', '世代', '最', '先进', '的', '多语种', 'NLP', '技术', '。'], ['奈須きのこ', 'は', '1973年', '11月', '28日', 'に', '千葉', '県', '円空山', 'で', '生まれ', '、', 'ゲーム', '制作', '会社', '「', 'ノーツ', '」', 'の', '設立', '者', 'だ', '。']]\n"
     ]
    }
   ],
   "source": [
    "print(mul([\n",
    "    'In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments.',\n",
    "    '2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。',\n",
    "    '2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。',\n",
    "    '奈須きのこは1973年11月28日に千葉県円空山で生まれ、ゲーム制作会社「ノーツ」の設立者だ。'\n",
    "]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "目前，多语种分词器的效果并不如单语种好。欢迎在你自己的单语种语料上自行训练新模型，也欢迎开源你的语料和模型。"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "authorship_tag": "ABX9TyPxXzYAXgLUW5uKV7v0/2iP",
   "collapsed_sections": [],
   "name": "tok_stl.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
